Scenario: A company approaches you to predict data scientist salaries with machine learning.
December 4, 2017
Scenario: A company approaches you to predict data scientist salaries with machine learning.
Machine learning is a method for teaching computers to make and improve predictions or behaviours based on data.
Kaggle conducted an industry-wide survey of data scientists. https://www.kaggle.com/kaggle/kaggle-survey-2017
Information asked:
Contains information from Kaggle ML and Data Science Survey, 2017, which is made available here under the Open Database License (ODbL).
library('mlr')
set.seed(42)
task = makeRegrTask(data = survey.dat, target = 'CompensationAmount')
lrn = makeLearner('regr.randomForest', importance=TRUE)
mod = train(lrn, task)
"There is a problem with the model!"
ice = generatePartialDependenceData(mod, task, features ='Age',
individual = TRUE)
plotPartialDependence(ice) + scale_y_continuous(limits=c(0, NA))
Goldstein, A., Kapelner, A., Bleich, J., & Pitkin, E. (2013). Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation, 1–22. https://doi.org/10.1080/10618600.2014.907095
ice.c = generatePartialDependenceData(mod, task, features ='Age',
individual = TRUE, center = list(Age=20))
plotPartialDependence(ice.c)
pdp = generatePartialDependenceData(mod, task, features =c('Age'))
plotPartialDependence(pdp) + scale_y_continuous(limits=c(0, NA))
Friedman, J. H. (1999). Greedy Function Approximation : A Gradient Boosting Machine. North, 1(3), 1–10. https://doi.org/10.2307/2699986
"We want to understand the model better!"
feat.imp = getFeatureImportance(mod, type=1)$res dat = gather(feat.imp, key='Feature', value='Importance') %>% arrange(Importance) dat$Feature = factor(dat$Feature, levels = dat$Feature) ggplot(dat) + geom_point(aes(y=Feature, x = Importance))
Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.
pdp = generatePartialDependenceData(mod, task, features =c('Gender'))
ggplot(pdp$data) + geom_point(aes(x=Gender, y=CompensationAmount)) +
geom_segment(aes(x=Gender, xend=Gender, yend=CompensationAmount), y=0) +
scale_y_continuous(limits=c(0, NA)) +
theme(axis.text.x = element_text(angle = 10, hjust = 1))
explanation <- lime(dat, mod) explainer <- lime::explain(dat[3, ], explanation, n_features = 3) plot_features(explainer, ncol=1)
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. Retrieved from http://arxiv.org/abs/1602.04938
Read my book about "Interpretable Machine Learning" https://christophm.github.io/interpretable-ml-book/